Homework 10

Instructions

General

  • Please submit your responses as a nicely formatted notebook (.ipynb) file.
  • If your answer includes many decimal places (e.g., 3.14159265358979), please round it to a reasonable number of decimals for readability (typically 3 to 4).
  • For ease of marking, avoid showing the outputs of unnecessary values.
  • Make sure your code runs without errors and shows all required outputs.
  • Good coding style matters — Disorganized, redundant, or poorly structured code may lose marks.

Plots, Packages, and Functions:

  • All plots must be made with ggplot2 and include clear, descriptive axis titles rather than default column names.
  • Only the following packages are permitted:
    • tidyverse

Requirements:

  • Unless otherwise specified, set any trimming you need to do at 20% and use an \(\alpha = 0.05\).
  • Unless the instructions explicitly state otherwise (e.g., “assume the data are normally distributed”, “build an ordinary least-squares model”), you are responsible for checking whether the assumptions of the method are reasonable and robust approaches need to be used.
    • If classical test assumptions are not violated, use the classical test.
  • Do not remove/handle outliers unless explicity asked to.


Question 1

The type of influencer your advertising company uses to brainwash users is categorized in to 4 levels: Macro, Mega, Micro, and Nano. Load DataDrivenMarketing.csv and remove any rows that contain NA values in the Sales, TV, or Influencer columns only. Report how many rows are left after doing this.

There are many many different ways to do this. Here is a simple method…

library(tidyverse)

# Load data
ddm <- read_csv("data/DataDrivenMarketing.csv") |> 
  drop_na(Sales, TV, Influencer)

nrow(ddm)
[1] 4539


Question 2

Using the NA removed data, build an ordinary least-squares regression model with both TV and Influencer as predictors of Sales. Are each of its coefficients significantly different than 0? What is the multiple \(R^2\) of this model?

# Dummy coding
ddm$Mac <- ifelse(ddm$Influencer == "Macro", 1, 0)
ddm$Meg <- ifelse(ddm$Influencer == "Mega", 1, 0)
ddm$Mic <- ifelse(ddm$Influencer == "Micro", 1, 0)

# Model
mod2 <- lm(Sales ~ TV + Mac + Meg + Mic, data = ddm)
mod2sum <- summary(mod2)
rsq <- round(mod2sum$r.squared, 3)

# Results
mod2sum$coefficients
cat("\n")
rsq
              Estimate Std. Error   t value      Pr(>|t|)
(Intercept) 189.523112 1.75647337 107.89979  0.000000e+00
TV            2.913543 0.03394066  85.84225  0.000000e+00
Mac          72.191913 2.43900997  29.59886 3.444084e-176
Meg          45.080510 2.16053220  20.86547  2.139665e-92
Mic          29.603079 1.96000432  15.10358  2.524918e-50

[1] 0.812

All the coefficients are significantly different than 0.

\(R^2 = 0.812\)


Question 3

Conduct a F-test to evaluate whether influencer significantly improved the fit fo the model over one with just TV as a predictor. Which is the preferred model?

  • Use of anova() is prohibited.

  • Report the F-statistic, degrees of freedom, and p-value.

mod1 <- lm(Sales ~ TV, data = ddm)
mod1sum <- summary(mod1)

N <- nrow(ddm)
P_mod2 <- 4
P_diff <- 4 - 1
Rsq_diff <- mod2sum$r.squared - mod1sum$r.squared

# F-statistic to compare the models
F_stat <- ((N - P_mod2 - 1) / P_diff) * (Rsq_diff / (1 - mod2sum$r.squared))

# P-Value
df1 <- P_diff
df2 <- N - P_mod2 - 1
p <- pf(F_stat, df1, df2, lower.tail = FALSE)

# Results
paste0("F = ", round(F_stat, 5))
paste0("df1 = ", df1, "; df2 = ", df2)
paste0("p = ", round(p, 5))
[1] "F = 294.65871"
[1] "df1 = 3; df2 = 4534"
[1] "p = 0"

Adding ‘influencer’ significantly improves the model and is the preferred choice.


Question 4

Using the preferred model from the previous question. Create a plot of the residuals to evaluate homogeneity of variance. Is the assumption reasonable?

ggplot(ddm, aes(x = mod2$fitted.values, y = mod2$residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 3, colour = 'red') +
  labs(
    x = "Fitted Values",
    y = "Residuals"
  )

Yes, the assumption is reasonable.


Question 5

Using the preferred model. Evaluate whether the residuals are normally distributed. Is the assumption reasonable?

ggplot(ddm, aes(sample = mod2$residuals)) +
  stat_qq() +
  stat_qq_line() +
  labs(
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  )

Yes, the assumption is reasonable.


Question 6

Some archaeologists theorize that ancient Egyptians interbred with several different immigrant populations over thousands of years. To see if there is any indication of changes in body structure that might have resulted, they measured skulls of male Egyptians from 5 different epochs.

  • Thomson and Randall-Maciver, Ancient Races of the Thebaid, Oxford: Oxford University Press, 1905.

The data can be found in “SkullsComplete.csv”. The column “mb” measures the maximum breadth of the skull in millimetres.

For the remaining questions, we will not be using the columns “bh”, “bl” or “nh”. Remove them from the dataframe and display the first 10 rows.

There are may different ways to remove unnecessary columns, the easiest is (with the tidyverse) to use the select() function and list which columns you want to keep.

library(tidyverse)

skulls <- read_csv("data/SkullsComplete.csv") |> 
  select(epoch, mb)

skulls[1:10, ]
# A tibble: 10 × 2
   epoch      mb
   <chr>   <dbl>
 1 c4000BC   131
 2 c4000BC   125
 3 c4000BC   131
 4 c4000BC   119
 5 c4000BC   136
 6 c4000BC   138
 7 c4000BC   139
 8 c4000BC   125
 9 c4000BC   131
10 c4000BC   134


Question 7

Create a barplot of the mean maximal breadth measured for each epoch in the “SkullsComplete.csv” data. Give the plot errorbars with classic 95% confidence intervals.

  • Order the epochs so that, left to right, they go from earliest (\(4000_\text{BC}\)) to latest (\(150_\text{AD}\)). Note that years classified as “B.C.” count backwards. E.g., \(3300_\text{BC}\) is more recent than \(4000_\text{BC}\).

  • Adjust the y-axis scale so that it goes from 120 at the bottom to 140 at the top.

  • Make the bars interesting colours (don’t use ggplot’s default colours).

  • Display a data frame that shows the group means, as well as the lower and upper boundaries of the confidence intervals. Do not display any other statistics in the data frame.


#Factoring and ordering epoch
skulls$epoch <- factor(skulls$epoch,
  levels = c(
    "c4000BC",
    "c3300BC",
    "c1850BC",
    "c200BC",
    "cAD150"
  )
)
# Getting statistics for plot
plot_data <- skulls |>
  group_by(epoch) |>
  summarise(
    n = length(mb),
    m = mean(mb),
    df = n - 1,
    alpha = 0.05,
    t_crit = abs(qt(alpha / 2, df = df)),
    se = sd(mb) / sqrt(n),
    moe = t_crit * se,
    ci_bot = m - moe,
    ci_top = m + moe
  )

plot_data |> select(epoch, m, ci_bot, ci_top)
# A tibble: 5 × 4
  epoch          m   ci_bot   ci_top
  <fct>      <dbl>    <dbl>    <dbl>
1 c4000BC 131.3667 129.4514 133.2820
2 c3300BC 132.3667 130.5706 134.1628
3 c1850BC 134.4667 133.1667 135.7666
4 c200BC  135.5    134.0365 136.9635
5 cAD150  136.1667 134.1688 138.1645
# Colour palette
palette <- hcl.colors(n = 7, palette = "Green-Brown")[2:6]

# Plot
ggplot(plot_data, aes(x = epoch, y = m)) +
  geom_bar(stat = "identity", aes(fill = epoch), colour = "black") +
  geom_errorbar(
    stat = "identity",
    aes(ymin = ci_bot, ymax = ci_top),
    width = .25
  ) +

  # Y-axis Scale Adjust
  coord_cartesian(ylim = c(120, 140)) +
  xlab("Epoch") +
  ylab("Maximum Skull Breadth (mm)") +
  theme(legend.position = "none") +

  # Colours
  scale_fill_manual(values = palette)


Question 8

Using the “SkullsComplete.csv”, create a ordinary least-squares linear model predicting maximal breadth (mb) with coefficients that make the following comparisons:

  • \(b_0 = 4000_{\text{BC}}\)
  • \(b_1 = 3300_{\text{BC}} - 4000_{\text{BC}}\)
  • \(b_2 = 1850_{\text{BC}} - 4000_{\text{BC}}\)
  • \(b_3 = 200_{\text{BC}} - 4000_{\text{BC}}\)
  • \(b_4 = 150_{\text{AD}} - 4000_{\text{BC}}\)

Report the model’s formula using the obtained coefficents.

  • e.g. \(\hat{y} = \text{(value)} + \text{(value)} x_1 + \text{(value)} x_2 \dots\)



Given the methods shown in class, there are two ways to achieve this: The first way is to use the ifelse() function to create your contrasts/dummy coding:

#Dummy Variables
skulls$cont1 <- ifelse(skulls$epoch == "c3300BC", 1, 0)
skulls$cont2 <- ifelse(skulls$epoch == "c1850BC", 1, 0)
skulls$cont3 <- ifelse(skulls$epoch == "c200BC", 1, 0)
skulls$cont4 <- ifelse(skulls$epoch == "cAD150", 1, 0)
#Model
skull_mod <- lm(mb ~ cont1 + cont2 + cont3 + cont4, data = skulls)
# skull_mod

Formula:

\(\hat{y} = 131.367 + 1 \cdot x_1 + 3.1 \cdot x_2 + 4.133 \cdot x_3 + 4.8 \cdot x_4\)



The second way is to manually set the contrasts in R:

#Look at the ordering of the factor levels
levels(skulls$epoch)
[1] "c4000BC" "c3300BC" "c1850BC" "c200BC"  "cAD150" 
# Create n-1 (4) Contrasts
cont1 <- c(0, 1, 0, 0, 0)
cont2 <- c(0, 0, 1, 0, 0)
cont3 <- c(0, 0, 0, 1, 0)
cont4 <- c(0, 0, 0, 0, 1)

contrasts(skulls$epoch) <- cbind(cont1, cont2, cont3, cont4)
contrasts(skulls$epoch)
        cont1 cont2 cont3 cont4
c4000BC     0     0     0     0
c3300BC     1     0     0     0
c1850BC     0     1     0     0
c200BC      0     0     1     0
cAD150      0     0     0     1
#Model
skull_mod <- lm(mb ~ cont1 + cont2 + cont3 + cont4, data = skulls)

Formula:

\(\hat{y} = 131.367 + 1 \cdot x_1 + 3.1 \cdot x_2 + 4.133 \cdot x_3 + 4.8 \cdot x_4\)


Question 9

Is the omnibus F-test from the previous question’s linear model statistically significant at an \(\alpha = 0.05\)? Report it’s value, degrees of freedom and p-value.

  • Ensure that the F-statistic and p-value are displayed to 6 decimal places.
Hint

To get accurate results, you will need to extract the F-stat values from the summary output.

# F-Stat, df1, and df2
mod_sum <- summary(skull_mod)

# P-value
p_val <- pf(mod_sum$fstatistic[1],
  df1 = mod_sum$fstatistic[2],
  df2 = mod_sum$fstatistic[3],
  lower.tail = FALSE
)

# Results
paste0("F_stat = ", round(mod_sum$fstatistic[1], 6))
paste0("df 1 = ", mod_sum$fstatistic[2])
paste0("df 2 = ", mod_sum$fstatistic[3])
paste0("p_val = ", round(p_val, 6))
[1] "F_stat = 5.954613"
[1] "df 1 = 4"
[1] "df 2 = 145"
[1] "p_val = 0.000183"

Yes, it is significant.


Question 10

What do the results you displayed in the previous question tell you?

  • There is at least one significant difference between the 5 different epochs.


Question 11

Based on the planned contrasts/comparisons you used to evaluate the “SkullsComplete.csv” data, what epochs was there a significant difference between?

round(mod_sum$coefficients, 4)
            Estimate Std. Error  t value Pr(>|t|)
(Intercept) 131.3667     0.8389 156.6006   0.0000
cont1         1.0000     1.1863   0.8429   0.4007
cont2         3.1000     1.1863   2.6131   0.0099
cont3         4.1333     1.1863   3.4841   0.0007
cont4         4.8000     1.1863   4.0461   0.0001

There is a significant difference between the following epochs:

  • \(1850_{\text{BC}}\) & \(4000_{\text{BC}}\)

  • \(200_{\text{BC}}\) & \(4000_{\text{BC}}\)

  • \(150_{\text{AD}}\) & \(4000_{\text{BC}}\)


Question 12

For the “SkullsComplete.csv” data, assume all the classic assumptions of a OLS model are satisfied. Calculate an omnibus F-test manually without using lm() or aov(). Report the following . . . .

  • Grand Mean
  • Total Sum of Squares
  • Model Sum of Squares
  • Residual Sum of Squares
  • Model Mean Squares
  • Residual Mean Squares
  • Multiple \(R^2\)
  • F statistic
  • Degrees of Freedom
  • p-value

Round displayed outputs to 6 decimal places for everything except the degrees of freedom.

Note

These results should be identical to the earlier F-statistic and p-value you obtained. If you get a different result, you have done something wrong somewhere somehow.

# Grand mean
grand_m <- mean(skulls$mb)

# Total sum of squares
SST <- sum((skulls$mb - grand_m)^2)

# Model sum of squares
SSM <- sum(plot_data$n * (plot_data$m - grand_m)^2)

# Sum of squared residuals
SSR <- SST - SSM

# R^2
R_sq <- SSM/SST

# Degrees of Freedom
df_t <- nrow(skulls) - 1
df_m <- 5 - 1
df_r <- df_t - df_m

# Mean Squares
MSm <- SSM / df_m
MSr <- SSR / df_r

# F-stat
F_stat <- MSm / MSr

# P-value
p <- pf(F_stat, df1 = df_m, df2 = df_r, lower.tail = FALSE)

# Results
paste0("Grand Mean = ", round(grand_m, 6))
paste0("Total Sum of Squares = ", round(SST, 6))
paste0("Model Sum of Squares = ", round(SSM, 6))
paste0("Model Mean Squares = ", round(MSm, 6))
paste0("Residual Mean Squares = ", round(MSr, 6))
paste0("Multiple R^2 = ", round(R_sq, 6))
paste0("F = ", round(F_stat, 6))
paste0("df1 = ", df_m, ", df2 = ", df_r)
paste0("p = ", round(p, 6))
[1] "Grand Mean = 133.973333"
[1] "Total Sum of Squares = 3563.893333"
[1] "Model Sum of Squares = 502.826667"
[1] "Model Mean Squares = 125.706667"
[1] "Residual Mean Squares = 21.110805"
[1] "Multiple R^2 = 0.141089"
[1] "F = 5.954613"
[1] "df1 = 4, df2 = 145"
[1] "p = 0.000183"


Question 13

Does the model you created for the “SkullsComplete.csv” data violate the normality assumption?


Note: There are two different ways to check this, either is correct.

Option 1: Check the model’s residuals

ggplot(mapping = aes(sample = mod_sum$residuals)) +
  stat_qq(size = 3) +
  stat_qq_line() +
  labs(
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  )


Option 2: Check the normality of the individual epoch levels

# Colour palette
palette <- hcl.colors(n = 7, palette = "Green-Brown")[2:6]
# this is for colouring my plot pretty colours

ggplot(skulls, aes(sample = mb)) +
  stat_qq(
    shape = 21,
    colour = "black",
    aes(fill = epoch),
    size = 2
  ) +
  stat_qq_line() +
  facet_wrap(~ epoch, scales = "free", nrow = 1) +
  scale_fill_manual(values = palette) +
  labs(
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme(legend.position = "none")

The data seem to be acceptably normally distributed.

Note

If you chose Option 2, there needs to be a seperate Q-Q plot for each group because they are independent groups (i.e., the distribution for each may be distinct). You should not be collapsing them into a single Q-Q plot.